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1  Summary 

In  this  project  we  have  had  two  partial  successes.  First  is  an  efficient  detection  algorithm  for 
objects  in  complex  scenes,  using  very  simple  spatial  arrangements  to  represent  the  objects, 
based  on  local  features  which  are  automatically  identified  in  training.  The  simplicity  of 
the  arrangement  allows  us  to  use  the  Hough  transform  to  very  quickly  find  a  small  number 
of  candidate  locations  for  the  objects.  We  have  also  proposed  a  parallel  architecture  for 
implementing  this  algorithm  with  interesting  biological  analogies.  Second  is  an  algorithm 
for  isolated  object  recognition  using  decision  trees  to  gradually  explore  the  natural  partial 
ordering  of  the  space  of  spatial  arrangements.  The  principles  of  this  algorithm  have  also 
been  successfully  applied  to  the  recognition  of  acoustic  signals. 

1.1  Shape  recognition 

A  shape  recognition  algorithm  has  been  developed  based  on  multiple  randomized  decision 
trees.  The  splits  in  the  trees  are  “queries”  regarding  the  presence  of  partially  invariant 
spatial  arrangements  of  local  features,  anywhere  in  the  image.  These  arrangements  are 
defined  through  pairwise  geometric  relations  between  the  features,  and  can  be  viewed  as 
labeled  graphs.  As  such  they  can  be  arranged  in  a  partial  ordering.  Trees  are  grown  by 
gradually  exploring  this  partial  ordering.  All  data  images  at  a  given  node  have  one  or  more 
instances  of  the  same  arrangement  present;  the  candidate  splits  entertained  at  that  node  are 
only  minimal  extensions  of  the  arrangement  allowing  only  one  additional  local  feature  and 
an  additional  relation. 

Multiple  trees  are  grown  by  choosing  the  best  split  from  among  a  small  random  sample  of 
all  minimal  extensions.  The  terminal  distributions  (conditional  distribution  on  class  at  each 
node)  of  the  trees  are  estimated  from  training  data.  A  test  point  is  classified  by  averaging 
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the  terminal  distributions  it  encounters  in  the  trees  and  taking  the  mode.  In  Amit  et  al. 
(1997)  the  application  of  this  approach  to  the  recognition  of  handwritten  digits  is  described. 
We  report  a  classification  rate  of  99.2%  on  the  NIST  database.  In  Amit  &  Geman  (1997) 
the  application  of  this  approach  to  the  recognition  of  hundreds  of  shape  classes  is  analyzed, 
together  with  some  theoretical  aspects  of  the  multiple  randomized  tree  algorithm.  This 
theoretical  analysis  is  continued  in  Amit  et  al.  (1999). 

To  summarize  objects  are  classified  in  terms  of  a  large  pool  of  very  sparse  spatial  ar¬ 
rangements  of  local  image  features.  The  pool  is  efficiently  accessed  using  decision  trees. 


1.2  Extensions  to  speech  recognition  and  theoretical  analysis 

The  relational  decision  tree  paradigm  has  been  successfully  applied  to  the  recognition  of 
isolated  spoken  digits.  The  features  are  very  simple  functions  of  the  spectrogram,  and  the 
relations  are  temporal.  Training  on  relatively  small  training  sets  this  approach  has  achieved 
higher  recognition  rates  than  state  of  the  art  Hidden  Markov  Model  methods.  This  is  in 
the  constrained  situation  of  isolated  utterances.  The  issue  of  combining  recognition  with 
segmentation  has  yet  to  be  addressed  in  this  context.  However  due  to  the  use  of  multiple 
trees  and  the  large  degree  of  invariance  incorporated  in  the  relations  the  method  is  very 
robust  to  errors  in  segmentation.  This  is  demonstrated  in  Amit  &  Murua  (1999)  by  testing 
on  randomly  truncated  versions  of  the  data. 

The  use  of  multiple  classifiers  and  in  particular  multiple  decision  trees  has  become  a 
very  powerful  tool.  See  for  example  Breiman  (1998),  Breiman  (1999),  Schapire  et  al.  (1998), 
Dietterich  (1998).  There  are  two  complimentary  methods  for  creating  multiple  classifiers. 
The  first  is  randomization,  for  example  the  features  employed  for  a  split  in  the  tree  or  the 
architecture  of  the  network.  The  second  is  boosting  where  higher  weights  are  given  to  data 
points  which  are  misclassified  by  the  current  set  of  classifiers.  In  Amit  et  al.  (1999)  we 
attempt  to  provide  a  unifying  explanation  for  the  role  of  these  two  approaches  as  methods 
for  conditional  covariance  minimization  between  the  classifiers.  Boosting  and  randomization 
are  shown  to  both  be  sampling  techniques  from  a  distribution  on  the  space  of  classifiers 
determined  by  the  protocol  and  the  training  set.  Certain  simple  moments  of  this  distribution 
seem  to  determine  the  performance  of  the  aggregate  classifier. 

1.3  Object  detection 

Spatial  arrangements  of  local  image  features  have  also  been  used  for  an  efficient  object 
detection  algorithm.  In  the  previous  proposal  we  suggested  an  approach  to  model  registration 
based  on  decomposable  graphs  of  local  image  features.  The  cliques  of  the  graphs  were 
triangles  and  a  cost  function  was  associated  to  each  triangle  penalizing  its  deviation  from 
the  model  triangle  on  the  template  graph.  All  features  are  found  in  the  image  and  a  dynamic 
programming  algorithm  on  the  decomposable  graph  yields  the  optimal  match.  See  Amit  & 
Kong  (1996)  and  Amit  (1997)  where  the  ideas  are  applied  to  automatic  anatomy  detection. 
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In  comparison  to  elastic  deformable  template  methods,  this  model  represents  a  signif¬ 
icant  simplification  of  the  underlying  graph  and  of  the  associated  computation.  A  sparse 
decomposable  graph  replaces  the  lattice  type  graph  underlying  the  elastic  models.  The 
computation  changes  from  continuum  based  gradient  descent  type  algorithms  to  discrete  dy¬ 
namic  programming.  Due  to  these  simplifications  the  output  of  the  algorithm  only  consists 
of  the  match  of  a  small  number  of  model  points,  not  of  the  entire  object,  thus  we  obtain  less 
information  on  the  instantiation  of  the  object.  The  graphical  models  do  not  require  initial¬ 
ization,  and  hence  provide  a  crucial  initial  step  for  the  elastic  matching  methods.  Another 
useful  attribute  of  the  sparse  graphical  model  is  the  possibility  to  explicitly  impose  con¬ 
straints  on  certain  deformations  based  on  prior  knowledge.  However  the  model  still  needs  to 
be  constructed  by  hand  and  slows  down  considerably  in  complex  scenes  with  a  large  degree 
of  clutter  or  confusing  background.  In  addition  the  model  fails  in  the  presence  of  occlusion. 

This  motivates  yet  a  further  simplification  of 
the  decomposable  graph,  to  even  simpler  graphs, 
providing  even  simpler  output:  object  location. 

In  Amit  (1998)  the  graph  has  been  reduced  to 
the  simplest  form  possible  while  maintaining  some 
form  of  constraints  on  the  spatial  arrangement  of 
the  local  features:  a  star  type  graph.  All  features 
are  constrained  to  lie  in  certain  regions  relative  to 
a  virtual  center,  see  Figure  1. 

The  regions  are  set  so  that  the  virtual  center  of  any  instantiation  consistent  with  the 
more  complex  graph,  would  be  detected.  It  is  important  that  the  simpler  models  detect 
any  instance  the  more  complex  models  would  detect,  namely  they  should  be  invariant  to 
instantiation  parameters  of  the  more  complex  models  which  are  not  part  of  the  output  of 
the  simpler  model. 

Surprisingly  the  statistics  of  real  images  are  such  that  using  a  star  type  graph  with  a 
moderate  number  of  local  features  one  obtains  a  very  low  number  of  false  negatives  at  the 
price  of  only  a  few  false  positives.  Moreover  the  graph  is  no  longer  required  to  be  present 
in  full  as  in  the  decomposable  case.  Rather  any  subset  of  sufficient  size  is  sufficient  to  call 
a  detection.  This  provides  substantial  robustness  to  occlusion.  The  training  stage  is  fully 
automatic  once  the  training  images  of  the  object  are  registered  to  a  fixed  scale  and  location 
on  a  reference  grid. 

A  large  pool  of  N  local  features  consisting  of  flexible  edge  arrange¬ 
ments  is  predefined.  A  center  edge  and  several  other  edges  allowed  to 
float  in  small  regions  around  the  center.  On  the  right  we  show  an  example 
of  a  definition  of  a  feature  with  three  edges  around  the  center.  The  feature 
is  present  at  a  location  in  the  image  if  the  center  edge  type  is  found  there 
and  each  of  the  other  edges  is  found  anywhere  in  the  respective  region 
relative  to  the  center. 

In  each  small  region  of  the  reference  grid  a  greedy  search  is  carried  out  in  this  pool 
for  a  feature  with  high  frequency  (say  greater  than  50%)  on  the  registered  training  images. 


Fig.  2 
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Typically  several  tens  of  locations  are  thus  identified  and  a  fixed  size  random  subset  of  say 
n  =  20  is  chosen. 

The  collection  of  identified  features/location  pairs  is  the  object 
representation.  The  associated  graph  has  an  edge  between  each 
feature  location  z  and  the  center  of  the  reference  grid.  The  edges 
are  labeled  by  the  region  Bz  in  which  the  feature  is  allowed  to 
float  around  the  model  location.  This  region  is  determined  for 
example  by  the  range  of  scales  and  rotations  at  which  the  object  is 
to  be  detected,  as  well  as  other  deformations  allowed  by  the  more 
complex  models.  See  Figure  1  (right.)  In  Figure  3  are  some  of  the 
features  identified  for  faces  at  their  location  on  the  reference  grid. 
The  center  edge  of  each  feature  defines  its  ideal  location. 

Given  an  image,  all  instances  of  each  of  the  model  features 
are  identified.  A  search  for  virtual  centers  in  the  image  where 
a  sufficiently  large  number,  say  r,  of  model  features  is  present, 
within  each  respective  region,  is  implemented  using  a  generalized 
Hough  transform.  This  is  ideally  suited  for  detecting  the  centers 
of  such  star  type  graphs.  Each  detected  feature  ‘votes’  for  a 
region  of  centers  consistent  with  the  location  of  the  feature  in 
the  model.  In  Figure  4  we  show  how  any  subgraph  of  size  3  of 
the  star  graph  of  Figure  1  is  found.  Fig.  4 

Scale  and  rotation  are  subsequently  identified  at  the  candidate  locations.  Other  more 
intensive  graph  matching  computations,  including  elastic  template  matching  can  be  carried 
out  as  well.  These  serve  both  to  provide  additional  information  on  the  instantiation  of  the 
object,  as  well  as  a  filter  on  the  false  positives.  If  the  match  of  the  complex  model  has 
too  high  a  cost,  the  detection  is  rejected.  To  find  the  object  at  significantly  larger  scales 
the  image  is  reprocessed  at  several  lower  resolutions.  This  approach  has  been  successfully 
implemented  for  face  detection,  symbol  detection,  and  detection  of  2d  views  of  rigid  objects. 
See  Figure  5  for  face  detections,  where  scale  and  rotation  are  estimated  and  employed  as  a 
rejection  mechanism.  Computation  time  on  a  standard  Pentium  II  is  around  1.5  seconds  for 
a  240x320  image  processed  as  6  resolutions.  Figure  6  shows  ‘paper  binder’  detections  with 
the  same  procedure. 


Fig.  3 


Fig  6- 

The  idea  of  learning  more  complex  features  and  then  represent  an  object  in  terms  of  a 
graph  describing  their  geometric  arrangement  can  be  found  in  Burl  et  al.  (1995),  Wiskott 
et  al.  (1997),  Cootes  &  Taylor  (1996).  In  these  approaches  the  features,  or  certain  relevant 
parameters,  are  also  identified  through  training.  One  clear  difference  however  is  that  the  ap¬ 
proach  presented  here  makes  use  only  of  binary  features  with  hardwired  invariances  with  well 
understood  statistical  properties,  and  employs  a  very  simple  form  spatial  arrangement  for 
the  object  representation.  This  leads  to  efficient  implementations  of  the  detection  algorithm. 

1.4  Neural  network  implementations 

An  attractive  feature  of  the  algorithm  outlined  above  for  training  and  detecting  objects  is 
the  possibility  of  implementing  it  in  a  neural  network  with  biologically  plausible  architecture, 
computation,  and  learning  mechanisms.  In  Amit  (1998)  we  propose  a  network  which  involves 
only  binary  neurons  and  is  capable  of  detecting  the  object  anywhere  in  the  visual  field  by 
implementing  the  Hough  transform  for  any  object  representation  evoked  in  a  central  memory 
module.  The  network  does  not  change  its  architecture  for  detecting  new  objects,  it  employs 
top-down  priming  in  order  to  direct  the  bottom  up  flow  of  information  (edge  and  local 
feature  maps)  in  such  a  way  that  the  Hough  transform  is  computed  in  a  simple  retinotopic 
summation  layer. 

Define  a  module  Mgiob  which  has  one  unit  corresponding  to  each  feature/location  pair. 
An  object  representation,  consists  of  a  small  collection  of  n  such  pairs  , yj)>j  =  1 ,. . .  ,n. 
For  simplicity  assume  n  is  the  same  for  all  objects,  and  that  the  number  r  needed  for  a 
detection  is  the  same  for  all  objects  as  well.  An  object  representation  is  a  simple  binary 
pattern  in  Mgiob,  with  n  l’s  and  ( N  —  n )  0’s.  Instances  of  each  of  the  N  local  features 
is  detected  in  an  array  Ft.  For  each  i,  introduce  a  system  of  arrays  Qi>z  indexed  by  the 
locations  z  in  the  reference  grid  G.  These  Q  arrays  lay  the  ground  for  the  detection  of  any 
object  representation  by  performing  the  ‘voting’  step  of  the  Hough  transform.  Thus  a  unit 

"W 

at  location  x  G  Ql>z  receives  input  from  a  region  x  +  Bz  in  Fj. 

Note  that  for  each  unit  u  =  (a*,  z)  in  Mgiob  there  is  a  corresponding  Qi<z  array.  All  units 
in  Qi,z  receive  input  from  u.  This  is  where  the  top-down  flow  of  information  is  achieved:  In 
order  to  be  activated  a  unit  x  in  QitZ  needs  both  (a*,  z )  and  some  unit  in  the  region  x  +  Bz 
in  Fi  to  be  activated.  Thus  the  representation  evoked  in  Mgiob  primes  the  appropriate  Q^z 
arrays  to  a  point  where  they  could  be  activated  if  the  appropriate  input  comes  from  below, 
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i.e.  the  F,  array.  The  system  of  Q  arrays  sum  into  an  array  Sgi0 1,.  A  unit  at  location  x  £  Sgiob 
receives  input  from  all  QiyZ  arrays  at  location  x  and  is  onif  J2iL\  X)zeG  Qi,z(x)  >  t.  The  array 
Sgiob  therefore  shows  those  locations  for  where  there  are  more  than  r  votes  in  the  Hough 
transform.  In  the  figure  below  we  provide  a  graphic  representation  of  of  this  net.  A  more 
sophisticated  network  involving  adaptable  local  feature  detectors  is  also  suggested  in  Amit 
(1998)  as  well  some  interesting  biological  analogies. 


In  this  example  the  object  is  defined  in 
terms  of  feature/location  (al} ,  zf)  pairs 
on  the  reference  grid  coded  by  a  unit 
in  Mgiob.  Three  of  the  four  have  to 
be  present  to  have  a  detection.  Each 
unit  provides  input  to  all  units  in  the 
corresponding  Qi>z  array  (thick  lines). 
The  locations  of  the  bottom-up  fea¬ 
ture  detections  are  shown  as  •’s  on  the 
Fi  arrays.  They  provide  input  to  ‘dis¬ 
placed’  regions  in  the  Q  arrays  shown 
as  thick  lines.  The  regions  are  define 
in  terms  of  neighborhoods  Bz  of  the 
model  location  z.  Only  locations  on 
the  Q  array  which  receive  input  both 
from  the  F  arrays  and  from  a  model 
unit  in  Mgi0b  -  double  thick  lines  -  is  ac¬ 
tually  on  and  contributes  to  the  sum¬ 
mation  into  Sgi0b •  Note  that  instances 
of  feature  a$  do  not  contribute  to  the 
detection  since  they  are  is  not  present 
in  the  correct  location  relative  to  the 
others.  There  are  N  systems  of  F,  Q 
arrays  one  for  each  local  feature  a. 


To  our  knowledge  there  is  no  alternative  network  in  the  literature  for  translation  invariant 
detection  of  objects  which  is  not  wired  for  a  specific  object  representation.  In  Amit  & 
Mascaro  (1999)  we  describe  a  neural  network  architecture  which  employs  local  Hebbian 
learning  both  to  achieve  object  recognition  and  to  create  object  representations  which  can 
drive  the  detection  network. 
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networks:  an  application  to  character  recognition,  Technical  report,  Department  of 
Statistics,  University  of  Chicago. 

10.  Amit,  Y.  &  Murua,  A.  (1999),  Speech  recognition  using  randomized  relational  decision 
trees,  Technical  report,  Department  of  Statistics,  University  of  Chicago. 

11.  Amit,  Y.,  Blanchard,  G.  &  Wilder,  K.  (1999),  Multiple  randomized  classifiers:  MRCL, 
Technical  report,  University  of  Chicago. 

12.  Amit,  Y.  (2000),  A  neural  network  architecture  for  visual  selection,  Neural  Computa¬ 
tion,  To  appear. 
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3  Scientific  Personnel 


Graduate  students 

•  Steve  Wang  -  Ph.  D.  1998. 

•  Gilles  Blanchard  -  1998-1999  (Visiting  student  from  Ecole  National  Superieur). 

Post-docs 

•  Bruno  Jedynak.  1996-1997. 

•  Ken  Wilder,  1997-1999. 

•  Massimo  Mascaro,  1999- 
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